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Release vl.8.0. ( Installation ) Welcome to the documentation for the internetarchive Python library, 

internetarchive is a command-line and Python interface to archive.org. Please report any issues on Github. 

If you’re not sure where to begin, the quickest and easiest way to get started is downloading a binary and taking a look 
at the command-line interface documentation. 
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Contents 



CHAPTER 1 


User’s Guide 


Installation 

System-Wide Installation 

Installing the internetarchive library globally on your system can be done with pip. This is the recommended 
method for installing internetarchive (see below for details on installing pip): 

$ sudo pip install internetarchive 

or, with easy_install: 

$ sudo easy_install internetarchive 

Either of these commands will install the internetarchive Python library and ia command-line tool on your 
system. 

Note: Some versions of Mac OS X come with Python libraries that are required by internetarchive (e.g. the 
Python package six). This can cause installation issues. If your installation is failing with a message that looks 
something like: 

OSError: [Errno 1] Operation not permitted: '/var/folders/bk/3wx7qs8d0x79tqbmcdmskl04Q000gp/T/pip-TG? 

You can use the — ignore-installed parameter in pip to ignore the libraries that are already installed, and 
continue with the rest of the installation: 

$ sudo pip install —ignore-installed internetarchive 

More details on this issue can be found here: https://github.com/pypa/pip/issues/3165 

Installing Pip 

The easiest way to install pip is probably using your operating systems package manager. 

Mac OS, with homebrew: 

$ brew install pip 
Ubuntu, with apt-get: 

$ sudo apt-get install python-pip 

If your OS doesn’t have a package manager, you can also install pip with get-pip.py: 
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$ curl -LQs https://bootstrap.pypa.io/get-pip.py 
$ python get-pip.py 

virtualenv 

If you don’t want to, or can’t, install the package system-wide you can use virtualenv to create an isolated Python 
environment. 

First, make sure virtualenv is installed on your system. If it’s not, you can do so with pip: 

$ sudo pip install virtualenv 
With easy_install: 

$ sudo easy_install virtualenv 

Or your systems package manager, apt-get for example: 

$ sudo apt-get install python-virtualenv 

Once you have virtualenv installed on your system, create a virtualenv: 

$ mkdir myproject 
$ cd myproject 
$ virtualenv venv 

New python executable in venv/bin/python 
Installing setuptools, pip.done. 

Activate your virtualenv: 

$ . venv/bin/activate 

Install internetarchive into your virtualenv: 

$ pip install internetarchive 

Snap 

You can install the latest ia snap, and help testing the most recent changes of the master branch in all the supported 
Linux distros with: 

$ sudo snap install ia —edge 

Every time a new version of i a is pushed to the store, you will get it updated automatically. 


Binaries 

Binaries are also available for the ia command-line tool: 

$ curl -LGs https://archive.org/download/ia-pex/ia 
$ chmod +x ia 

Binaries are generated with PEX. The only requirement for using the binaries is that you have Python installed on 
Unix-like operating system. 

For more details on the command-line interface please refer to the README, or ia help. 
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Get the Code 

Internetarchive is actively developed on GitHub. 

You can either clone the public repository: 

$ git clone git://github.com/jjjake/internetarchive.git 

Download the tarball: 

$ curl -OEi https://github.com/jjjake/internetarchive/tarball/master 

Or, download the zipball: 

$ curl -OL https://github.com/jjjake/internetarohive/zipball/master 

Once you have a copy of the source, you can install it into your site-packages easily: 

$ python setup.py install 

Quickstart 

Configuring 

Certain functionality of the internetarchive Python library requires your archive.org credentials. Your IA-S3 keys are 
required for uploading, searching, and modifying metadata, and your archive.org logged-in cookies are required for 
downloading access-restricted content and viewing your task history. To automatically create a config file with your 
archive.org credentials, you can use the ia command-line tool: 

$ ia configure 

Enter your arehive.org credentials below to configure 'ia'. 

Email address: user@example.com 
Password: 

Config saved to: /home/user/.config/ia.ini 

Your config file will be saved to $HOME/ . config/ia. ini, or $HOME/ . ia if you do not have a . config 
directory in $HOME. Alternatively, you can specify your own path to save the config to via ia — conf ig-f ile 
'~/.ia-custom-config' configure. 

If you have a netc file with your archive.org credentials in it, you can simply run ia configure — netrc. Note 
that Python’s netrc library does not currently support passphrases, or passwords with spaces in them, and therefore not 
currently suported here. 


Uploading 

Creating a new item on archive.org and uploading files to it is as easy as: 

»> from internetarchive import upload 

>» md = diet (collection=' test_collection' , title='My New Item', mediatype=' movies' ) 
»> r = upload (' <identifier>' , files=[ 'foo.txt' , 'bar.mov'], metadata=md) 

»> r [0] . status_code 
200 

You can set remote filename using a dictionary: 
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>» r = upload (' cidentifier> ' , files= {' remote-name.txt' : 'local-name.txt'}) 

You can upload file-like objects: 

»> r = upload (' iacli-test-item301 ' , {'foo.txt': StringlO (u'bar baz boo')}) 

If the item already has a file with the same filename, the existing file within the item will be overwritten. 

upload can also upload directories. For example, the following command will upload my_dir and all of it’s contents 

to https://archive.org/download/my_item/my_dir/: 

»> r = upload (' my_item' , 'my_dir') 

To upload only the contents of the directory, but not the directory itself, simply append a slash to your directory: 

»> r = upload (' my_item' , 'my_dir/') 

This will upload all of the contents of my_dir to https : //archive . org/download/my_item/. upload 
accepts relative or absolute paths. 

Note: metadata can only be added to an item using the upload function on item creation. If an item already exists 
and you would like to modify it’s metadata, you must use modify_metadata. 


Metadata 

Reading Metadata 

You can access all of an item’s metadata via the Item object: 

»> from internetarchive import get_item 
>» item = get_item(' iacli-test-item301' ) 

>» item. item_metadatat'iftetadata' ] ['title'] 

'My TifStle' 

get_item retrieves all of an item’s metadata via the Internet Archive Metadata API. This metadata can be accessed 
via the Item. item_metadata attribute: 

»> item. item_metadata. keys () 

dict_keys (['created', 'updated', 'd2', ' uniq''metadata' , ' item_size', 'dir', 'dl', ' fi 
All of the top-level keys in item. item_metadata are available as attributes: 

»> item, server 
'iaSClbOV.us.archive.org' 

>» item. item_size 
15175202.4 

>» item, files [0] ['name' ] 

»> item.metadata [' identifier' ] 

'iacli-test-item301' 


Writing Metadata 

Adding new metadata to an item can be done using the modify_metadata function: 
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»> from internetarchive import modify_metadata 

»> r = modify_metadata( '<identifier>' , metadata=dict(title=' My Stuff')) 

»> r. status_code 
200 

Modifying metadata can also be done via the Item object. For example, changing the title we set in the example 
above can be done like so: 

»> r = item.modify_metadata(diet(title='My New Title')) 

>» item.metadata ['title' ] 

'My New Title' 

To remove a metadata field from an item’s metadata, set the value to ' REMOVE_TAG': 

»> r = item.modify_metadata(diet(foo=' new metadata field.')) 

>» item.metadata ['f 09' ] 

'new metadata field.' 

»> r = item.modify_metadata(diet(title=' REMOVE_TAG' )) 

»> print (item.metadata.get (' foo' ) ) 

None 

The default behaviour of modify_metadata is to modify item-level metadata (i.e. title, description, etc.). If we 
want to modify different kinds of metadata, say the metadata of a specific hie, we have to change the metadata target 
in the call to modify_metadata: 

»> r = item.modify_metadata(diet(title='My File Title'), target=' files/foo.txt' ) 

>» f = item.get_file (' foo.txt' ) 

»> f.title 
'My File Title' 

Refer to Internet Archive Metadata for more specific details regarding metadata and archive.org. 


Downloading 

Downloading hies can be done via the download function: 

»> from internetarchive import download 
»> download (' nasa' , verbose=True) 

downloaded nasa/globe_west_540.jpg to nasa/globe_west_540.jpg 
downloaded nasa/NASAarchiveLogo.jpg to nasa/NASAarchiveLago.jpg 
downloaded nasa/globe_west_540_thumb.jpg to nasa/globe_west_540_thumb.jpg 
downloaded nasa/nasa_reviews.xml to nasa/nasa_reviews.xml 
downloaded nasa/nasa_meta.xml to nasa/nasa_meta.xml 
downloaded nasa/nasa_archive.torrent to nasa/nasa_archive.torrent 
downloaded nasa/nasa_files.xml to nasa/nasa_files.ssml 

By default, the download function sets the mtime for downloaded hies to the mtime of the hie on archive.org. 
If we retry downloading the same set of hies we downloaded above, no requests will be made. This is because the 
hlename, mtime and size of the local hies match the hlename, mtime and size of the hies on archive.org, so we assume 
that the hie has already been downloaded. For example: 

»> download (' nasa' , verbose=True) 

skipping nasa/globe_west_540.jpg, file already exists based on length and date, 
skipping nasa/NASAarchiveLogo.jpg, file already exists based or. length and date, 
skipping nasa/globe_west_540_thumb.jpg, fil® already exists based on length and date, 
skipping nasa/nasa_reviews.xml, file already exists based on length and date. 


1.2. Quickstart 




internetarchive Documentation, Release 1.8.0 


skipping nasa/nasa_meta.xml, file already exists based on length and date, 
skipping nasa/r.asa_archive.torrent, file already exists based on length and date, 
skipping nasa/nasa_files.xml, file already exists based on length and date. 

Alternatively, you can skip files based on md5 checksums. This is will take longer because checksums will need to be 
calculated for every file already downloaded, but will be safer: 

»> download(' nasa' , verbose=True, checksum=True) 

skipping nasa/globe_west_540.jpg, file already exists based on checksum, 
skipping nasa/NASAarchiveLogo.jpg, file already exists based on checksum, 
skipping nasa/globe_west_540_thumb.jpg, file already exists based on checksum, 
skipping nasa/nasa_reviews.xml,r ‘ffile already exists based on checksum, 
skipping nasa/nasajneta.xml, file already exists based on checksum, 
skipping nasa/nasa_archive.torrent, file already exists based on checksum, 
skipping nasa/nasa_files.xml, file already exists based on length and date. 

By default, the download function will download all of the files in an item. However, there are a couple parameters 
that can be used to download only specific files. Files can be filtered using the glob_pattern parameter: 

»> download( 'nasa' , verbose=True, glob_pattern=' *xml' ) 

downloaded nasa/nasa_reviews.xml to nasa/nasa_reviews.xml 
downloaded nasa/nasa_meta.xml to nasa/nasa_meta.xml 
downloaded nasa/nasa_files.xml to nasa/nasa_files.xml 

Files can also be filtered using the formats parameter, formats can either be a single format provided as a string: 

»> download(' goodytwoshoesOOnewyiala' , verbose=True, formats=' MARC' ) 
goodytwoshaesOOnewyiala: 

downloaded goodytwoshoesOOnewyiala/goodytwoshoesOOnewyiala_meta.mrc to goodytwoshoesOQnewyiala/good] 

Or, a list of formats: 

»> download(' goodytwoshoesOOnewyiala' , verbose=True, formats=[ 'DjVuTXT' , 'MARC']) 
goodytwoshoesOOnewyiala: 

downloaded goodytwoshoesOOnewyiala/goodytwoshoesQOnewyiala_meta.mrc to goodytwoshoesOinewyiala/good] 
downloaded goodytwoshoesOOnewyiala/goodytwoshoesOOnewyiala_djvu.txt to goodytwoshoesOOnewyiala/good] 


Downloading On-The-Fly Files 

Some files on archive.org are generated on-the-fly as requested. This currently includes non-original files of the formats 
EPUB, MOBI, DAISY, and archive.org’s own MARC XML. These files can be downloaded using the on_the_f ly 
parameter: 

»> download (' goodytwoshoesOOnewyiala' , verbose=True, f ormats=' EPUB' , on_the_fly=True) 
goodytwoshoesOOnewyiala: 

downloaded goodytwoshoesOOnewyiala/goodytwoshoesOOnewyiala.epub to goodytwoshoesOOnewyiala/goodytwo: 


Searching 

The search_items function can be used to iterate through archive.org search results: 

»> from internetarchive import search_items 
»> for i in search_items (' identifier:nasa' ): 

... print (i [' identifier'] ) 
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nasa 

search_items can also yield Item objects: 

»> from internetarchive import search_items 

»> for item in search_items(' identifier:nasa' ).iter_as_items(): 
print (item) 

Collection(identifier^nasa', exists=True) 
search_items will automatically paginate through large result sets. 


Command-Line Interface 


The ia command-line tool is installed with internetarchive, or available as a binary, ia allows you to interact 
with various archive.org services from the command-line. 


Getting Started 

The easiest way to start using ia is downloading a binary. The only requirements of the binary are a Unix-like 
environment with Python installed. To download the latest binary, and make it executable simply: 

$ curl -LOs https://archive.org/download/ia-pex/ia 
$ chmod +x ia 
$ ./ia help 

A command line interface to archive.org. 

ia [—help | —version] 

ia [—config-file FILE] [—log | —debug] [—insecure] <command> [<args>]... 


-v, —version 

-c, —config-file FILE Use FILE as config file. 

-ly. *»»*Jog Turn on logging [default: False], 

-d, —debug Turn on verbose logging [default: False]. 

-i, —insecure Use HTTP for all requests instead of HTTPS [default: fal: 


>] 


commands: 

metadata 

upload 

download 

delete 

search 

list 


Retrieve help for subcommands. 

Configure 'ia'. 

Retrieve and modify metadata for items on archive.org. 
Upload items to archive.org. 

Download tiles from archive.org. 

Delete files from archive.org. 

Search archive.org. 

Retrieve information about your archive.org catalog tasks. 
List files in a given item. 


See ' ia help <command> f for more information on a specific command. 


Metadata 
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Reading Metadata 

You can use ia to read and write metadata from archive.org. To retrieve all of an item’s metadata in JSON, simply: 

$ ia metadata TripDownl905 

A particularly useful tool to use alongside ia is jq. j q is a command-line tool for parsing JSON. For example: 

$ ia metadata TripDownl905 | jq '.metadata.date' 

"1906" 

Modifying Metadata 

Once ia has been configured, you can modify metadata: 

$ ia metadata <identi£ier> —modify="foo:bar" —modify="baz:foooo" 

You can remove a metadata field by setting the value of the given field to REMOVE_TAG. For example, to remove the 
metadata field f oo from the item cidentif ier>: 

$ ia metadata <identifier> —modify="foo:REMOVE_TAG" 

Note that some metadata fields (e.g. mediatype) cannot be modified, and must instead be set initially on upload. 

The default target to write to is metadata. If you would like to write to another target, such as files, you can 
specify so using the — target parameter. For example, if we had an item whose identifier was my_identif ier 
and we wanted to add a metadata field to a file within the item called f oo. txt: 

$ ia metadata my_identifier —target="files/foo.txt" —modify="title:My File" 

You can also create new targets if they don’t exist: 

$ ia metadata <identifier> —target="extra_metadata" —modify="foo:bar" 

There is also an — append option which allows you to append a string to an existing metadata strings (Note: use 
— append-list for appending elments to a list). For example, if your item’s title was Foo and you wanted it to be 
Foo Bar, you could simply do: 

$ ia metadata <identifier> —append="title: Bar" 

If you would like to add a new value to an existing field that is an array (like subject or collection), you can 
use the — append-list option: 

$ ia metadata <identifier> —append-list="subject:another subject" 

This command would append another subject to the items list of subjects, if it doesn’t already exist (i.e. no 
duplicate elements are added). 

Metadata fields or elements can be removed with the — remove option: 

$ ia metadata <identifier> —remove="subject:another subject" 

This would remove another subject from the items subject field, regardless of whether or not the field is a single 
or multi-value field. 

Refer to Internet Archive Metadata for more specific details regarding metadata and archive.org. 
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Modifying Metadata in Bulk 

If you have a lot of metadata changes to submit, you can use a CSV spreadsheet to submit many changes with a single 
command. Your CSV must contain an identifier column, with one item per row. Any other column added will 
be treated as a metadata field to modify. If no value is provided in a given row for a column, no changes will be 
submitted. If you would like to specify multiple values for certain fields, an index can be provided: sub ject [0 ], 
subject [ 1 ]. Your CSV file should be UTF-8 encoded. See metadata.csv for an example CSV file. 

Once you’re ready to submit your changes, you can submit them like so: 

$ ia metadata — spreadsheet=metadata.csv 
See ia help metadata for more details. 

Upload 

ia can also be used to upload items to archive.org. After configuring ia, you can upload files like so: 

$ ia upload <identifier> filel file2 — metadata="mediatype:texts" — metadata="blah:arg" 

Please note that, unless specified otherwise, items will be uploaded with a data mediatype. This 
cannot be changed afterwards. Therefore, you should specify a mediatype when uploading, eg. 

— metadata="mediatype:movies" 

You can upload files from stdin: 

$ curl http://dumps.wikimedia.org/kYwiki/20130927/kywiki-20130927-pages-logging.xml.gz \ 

| ia upload <identifier> - —remote-name=kywiki-20130927-pages-logging.xml.gz —metadata="title:Up: 

You can use the —retries parameter to retry on errors (i.e. if IA-S3 is overloaded): 

$ ia upload <identifier> filel — retries 10 

Note that ia upload makes a backup of any files that are clobbered. They are saved to a directory in the item named 
'history/ files/. The files are named in the format $key. ~N~. These files can be deleted like normal files. You 
can also prevent the backup from happening on clobbers by adding -H x-archive-keep-old-version: 0 to 
your command. 

Refer to archive.org Identifiers for more information on creating valid archive.org identifiers. Please also read the 
Internet Archive Items page before getting started. 

Bulk Uploading 

Uploading in bulk can be done similarly to Modifying Metadata in Bulk. The only difference is that you must provide 
a file column which contains a relative or absolute path to your file. Please see uploading.csv for an example. 

Once you are ready to start your upload, simply run: 

$ ia upload — spreadsheet=uploading.csv 
See ia help upload for more details. 

Download 

Download an entire item: 


1.3. Command-Line Interface 


11 




internetarchive Documentation, Release 1.8.0 


$ ia download TripDownl905 

Download specific files from an item: 

$ ia download TripDownl905 TripDownl905_512kb.mp4 TripDownl905.ogv 
Download specific files matching a glob pattern: 

$ ia download TripDownl905 —glob="*.mp4" 

Note that you may have to escape the * differently depending on your shell (e.g. \* . mp4, ' * . mp4' , etc.). 
Download only files of a specific format: 

$ ia download TripDownl905 —format='S12Kb MPEG4' 

Note that —format cannot be used with —glob. You can get a list of the formats of a given item like so: 

$ ia metadata —formats TripDownl905 
Download an entire collection: 

$ ia download —search 'collection:glasgowschoolafart' 

Download from an itemlist: 

$ ia download —itemlist itemlist.txt 
See ia help download for more details. 

Downloading On-The-Fly Files 

Some files on archive.org are generated on-the-fly as requested. This currently includes non-original files of the 
formats EPUB, MOBI, DAISY, and archive.org’s own MARC XML. These files can be downloaded using the 
—on-the-fly parameter: 

$ ia download gaodytwoshoesOOnewyiala —on-the-fly 

Delete 

You can use ia to delete files from archive.org items: 

$ ia delete <identifier> <file> 

Delete a file and all files derived from the specified file: 

$ ia delete <identifier> <file> —cascade 
Delete all files in an item: 

$ ia delete <identifier> —all 

Note that ia delete makes a backup of any files that are deleted. They are saved to a directory in the item named 
history/files /. The files are named in the format $key. ~N~. These files can be deleted like normal files. You 
can also prevent the backup from happening on deletes by adding -H x-archive-keep-old-version: 0 to 
your command. 

See ia help delete for more details. 
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Search 

ia can also be used for retrieving archive.org search results in JSON: 

$ ia search 'subject:"market street" collection:prelinger' 

By default, ia search attempts to return all items meeting the search criteria, and the results are sorted by item 
identifier. If you want to just select the top n items, you can specify a page and rows parameter. For example, to get 
the top 20 items matching the search ‘dogs’: 

$ ia search — parameters="page=l&rows=20" "dogs" 

You can use ia search to create an itemlist: 

$ ia search ' collection:glasgowschoolofart' — itemlist > itemlist.txt 

You can pipe your itemlist into a GNU Parallel command to download items concurrently: 

$ ia search 'collection:glasgowschoolofart' —itemlist | parallel r ia download {}' 

See ia help search for more details. 

Tasks 

You can also use ia to retrieve information about your catalog tasks, after configuring ia. To retrieve the task history 
for an item, simply run: 

$ ia tasks <ider.tifier> 

View all of your queued and running archive.org tasks: 

$ ia tasks 

See ia help tasks for more details. 

List 

You can list hies in an item like so: 

$ ia list goodytwoshoesOOnewyiala 
See ia help list for more details. 

Copy 

You can copy hies in archive.org items hke so: 

$ ia copy <src-identifier>/<src**Cilename> <dest-identifier>/<dest-Jllename> 

If you’re copying your hie to a new item, you can provide metadata as well: 

$ ia copy <src-identifier>/<src-filename> <dest-identifier>/<dest-filename> — metadata 'title:My New 

Note that ia copy makes a backup of any hies that are clobbered. They are saved to a directory in the item named 
history/files/. The hies are named in the format $key. ~N~. These hies can be deleted like normal hies. You 
can also prevent the backup from happening on clobbers by adding -H x-archive-keep-old-version: 0 to 
your command. 
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Move 

ia move works just like ia copy except the source file is deleted after the file has been successfully copied. 

Note that ia move makes a backup of any files that are clobbered or deleted. They are saved to a directory 
in the item named history/files/. The files are named in the format $key.~N~. These files can be 
deleted like normal files. You can also prevent the backup from happening on clobbers or deletes by adding -H 
x-archive-keep-old-version : 0 to your command. 


Internet Archive Items 

What Is an Item? 

Archive.org is made up of “items”. An item is a logical “thing” that we represent on one web page on archive.org. 
An item can be considered as a group of files that deserve their own metadata. If the files in an item have separate 
metadata, the files should probably be in different items. An item can be a book, a song, an album, a dataset, a movie, 
an image or set of images, etc. Every item has an identifier that is unique across archive.org. 


How Items Are Structured 

An item is just a directory of files and possibly subdirectories. Every item has at least two files named in the following 
format (see metadata page for more context on what an identifier is): 

• cidentifier>_files.xml 

• cidentifier>_meta.xml 

The _meta. xml file is an XML file containing all of the metadata describing the item. The _f iles . xml file is an 
XML file containing all of the file-level metadata. There can only be one _meta . xml file and one _f iles . xml 
file per item. 

Alongside these metadata files and the original files uploaded to the item, the item may also contain derivative files 
automatically generated by archive.org. 


Item Limitations 

As a rule of thumb, items should: 

• not be over 100GB 

• not contain more than 10,000 files. 


Collections 

All items must be part of a collection. A collection is simply an item with special characteristics. Besides an image 
file for the collection logo, files should never be uploaded directly to a collection item. Items can be assigned to 
a collection at the time of creation, or after the item has been created by modifying the collection element 
in an item’s metadata to contain the identifier for the given collection (i.e. ia metadata cidentif ier> -m 
collection recollect ion-identif iet>. Currently collections can only be created by archive.org staff. 
Please contact info@archive.org if you need a collection. 
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Archival URLs 

An item’s “details” page will always be available at: 
https://archive.org/details/<identifier> 

The item directory is always available at: 

https://archive.org/download/<identifier> 

A particular file can always be downloaded from: 

https://archive.org/download/<identifier>/<filename> 

Note: Archival URLs may redirect to an actual server that contains the content. The resultant URL is not a permalink. 
For example, the archival URL: 

https://archive.org/download/popeye_taxi-turvey/popeye_taxi-turvey_meta.xml 
currently redirects to: 

https://ia802304 .us .archive.org/30/items/popeye_taxi-turvey/popeye_taxi-turvey_meta.xml 

DO NOT LINK to any archive.org URL that begins with numbers like this. This refers to the particular machine that 
we’re serving the hie from right now, but we move items to new servers all the time. If you link to this sort of URL, 
instead of the archival URL, your link WILL break at some point. 


Internet Archive Metadata 

Metadata is data about data. In the case of Internet Archive items, the metadata describes the contents of the items. 
Metadata can include information such as the performance date for a concert, the name of the artist, and a set list for 
the event. 

Metadata is a very important element of items in the Internet Archive. Metadata allows people to locate and view 
information. Items with little or poor metadata may never be seen and can become lost. 

Note that metadata keys must be valid XML tags. Please refer to the XML Naming Rules section here. 


Archive.org Identifiers 

Each item at Internet Archive has an identifier. An identifier is composed of any unique combination of alphanumeric 
characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that identifiers be 
between 5 and 80 characters in length. 

Identifiers must be unique across the entirety of Internet Archive, not simply unique within a single collection. 

Once defined an identifier can not be changed. It will travel with the item or object and is involved in every manner 
of accessing or referring to the item. 


Standard Internet Archive Metadata Fields 

There are several standard metadata fields recognized for Internet Archive items. Most metadata fields are optional. 
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addeddate 

Contains the date on which the item was added to Internet Archive. 

Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats: 

• YYYY 

• YYYY-MM-DD 

• YYYY-MM-DD HH:MM:SS 

While it is possible to set the addeddate metadata value it is not recommended. This value is typically set by 
automated processes. 

adder 

The name of the account which added the item to the Internet Archive. 

While is is possible to set the adder metadata value it is not recommended. This value is typically set by automated 
processes. 

collection 

A collection is a specialized item used for curation and aggregation of other items. Assigning an item to a collection 
defines where the item may be located by a user browsing Internet Archive. 

A collection must exist prior to assigning any items to it. Currently collections can only be created by Internet Archive 
staff members. Please contact Internet Archive if you need a collection created. 

All items should belong to a collection. If a collection is not specified at the time of upload, it will be added to the 
opensource collection. For testing purposes, you may upload to the test_callection collection. 

contributor 

The value of the contributor metadata field is information about the entity responsible for making contributions 
to the content of the item. This is often the library, organization or individual making the item available on Internet 
Archive. 

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed. 

coverage 

The extent or scope of the content of the material available in the item. The value of the coverage metadata field 
may include geographic place, temporal period, jurisdiction, etc. For items which contain multi-volume or serial 
content, place the statement of holdings in this metadata field. 

creator 

An entity primarily responsible for creating the files contained in the item. 
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credits 

The participants in the production of the materials contained in the item. 

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed. 

date 

The publication, production or other similar date of this item. 

Please use an ISO 8601 compatible format for this date. 

description 

A description of the item. 

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed. 

language 

The primary language of the material available in the item. 

While the value of the language metadata field can be any value, Internet Archive prefers they be MARC21 Lan¬ 
guage Codes. 

licenseurl 

A URL to the license which covers the works contained in the item. 

Internet Archive recommends (but does not require) Creative Commons licensing. Creative Commons provides a 
license selector for finding the correct license for your needs. 

mediatype 

The primary type of media contained in the item. While an item can contain files of diverse mediatypes the value 
in this field defines the appearance and functionality of the item’s detail page on Internet Archive. In particular, the 
mediatype of an item defines what sort of online viewer is available for the files contained in the item. 

The mediatype metadata field recognizes a limited set of values: 

• audio: The majority of audio items should receive this mediatype value. Items for the Live Music Archive 
should instead use the etree value. 

• collection: Denotes the item as a collection to which other collections and items can belong. 

• data: This is the default value for mediatype. Items with a mediatype of data will be available in Internet 
Archive but you will not be able to browse to them. In addition there will be no online reader/player for the files. 

• etree: Items which contain files for the Live Music Archive should have a mediatype value of etree. The 
Live Music Archive has very specific upload requirements. Please consult the documentation for the Live Music 
Archive prior to creating items for it. 

• image: Items which predominantly consist of image files should receive a mediatype value of image. Cur¬ 
rently these items will not be available for browsing or online viewing in Internet Archive but they will require 
no additional changes when this mediatype receives additional support in the Archive. 
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• movies: All videos (television, features, shorts, etc.) should receive a mediatype value of movies. These 
items will be displayed with an online video player. 

• software: Items with a mediatype of software are accessible to browse via Internet Archive’s software 
collection. There is no online viewer for software but all files are available for download. 

• texts: All text items (PDFs, EPUBs, etc.) should receive a mediatype value of texts. 

• web: The web mediatype value is reserved for items which contain web archive WARC files. 

If the mediatype value you set is not in the list above it will be saved but ignored by the system. The item will be 
treated as though it has a mediatype value of data. 

If a value is not specified for this field it will default to data. 

noindex 

All items will have their metadata included in the Internet Archive search engine. To disable indexing in the search 
engine, include a noindex metadata tag. The value of the tag does not matter. Its presence is enough to trigger not 
including the metadata in the search engine. 

If an item’s metadata has already been indexed in the search engine, setting noindex will remove it from the index. 

Items whose metadata is not included in the search engine index are not considered “public” per se and therefore will 
not have a value in the publicdate metadata field (see below). 

notes 

Contains user-defined information about the item. 

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed. 

pick 

On the vl archive.org site, each collection page on Internet Archive may include a “Staff Picks” section. This section 
will highlight a single item in the collection. This item will be selected at random from the items with a pick 
metadata value of 1. If there are no items with this pick metadata value the “Staff Picks” section will not appear on 
the collection page. 

By default all new items have no pick metadata value. Note: v2 of the archive.org website does not make use of this 
value. 

publicdate 

Items which have had their metadata included in the Internet Archive search engine index are considered to be public. 
The date the metadata is added to the index is the public date for the item. 

Please use an ISO 8601 compatible format for this date. For instance, these are all valid date formats: 

• YYYY 

• YYYY-MM-DD 

• YYYY-MM-DD HH:MM:SS 

While it is possible to set the publicdate metadata value it is not recommended. This value is typically set by 
automated processes. 
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publisher 

The publisher of the material available in the item. 

rights 

A statement of the rights held in and over the files in the item. 

The value of this metadata field may contain HTML. <script> tags and CSS are not allowed. 

subject 

Keyword(s) or phrase(s) that may be searched for to find your item. This field can contain multiple values: 

$ ia metadata <identifier> —modify='subject:foo' —modify='subject:bar' 

Or, in Python: 

»> from internetarchive import modify_metadata 
»> md = diet (sub ject= [' foo' , 'bar']) 

»> r = modify_metadata( ' <identifier>' , md) 

It is helpful but not necessary for you to use Library of Congress Subject Headings for the value of this metadata 
header. 

title 

The title for the item. This appears in the header of the item’s detail page on Internet Archive. 

If a value is not specified for this field it will default to the identifier for the item. 

updatedate 

The date on which an update was made to the item. This field is repeatable. 

Please use an ISO 8601 compatible format for this date. 

While it is possible to set the publiedate metadata value it is not recommended. This value is typically set by 
automated processes. 

updater 

The name of the account which updated the item. This field is repeatable. 

While it is possible to set the updater metadata value it is not recommended. This value is typically set by automated 
processes. 

uploader 

The name of the account which uploaded the file(s) to the item. 

The uploader has ownership over the item and is allowed to maintain it. 

This value is set by automated processes. 
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Custom Metadata Fields 

Internet Archive strives to be metadata agnostic, enabling users to define the metadata format which best suits the 
needs of their material. In addition to the standard metadata fields listed above you may also define as many custom 
metadata fields as you require. These metadata fields can be defined ad hoc at item creation or metadata editing time 
and do not have to be defined in advance. 


Developer Interface 

Configuration 

Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying meta¬ 
data, searching). Your credentials and other configurations can be provided via a dictionary when instantiating an 
ArchiveSession or Item object, or in a config file. 

The easiest way to create a config file is with the configure function: 

»> from internetarchive import configure 
»> configure (' usergexample.com' , 'password') 

Config files are stored in either $H0ME/ . ia or $H0ME/. config/ia. ini by default. You can also specify your 
own path: 

»> from internetarchive import configure 

>>> configure (' user@example.com' , 'password', config_f ile=' /home/jake/.config/ia-alternat 

Custom config files can be specified when instantiating an ArchiveSession object: 

»> from internetarchive import get_session 

»> s = get_session(config_f ile=' /home/jake/.config/ia-alternate.ini' ) 

Or an Item object: 

»> from internetarchive import get_item 

>>> item = get_item( 'nasa' , config_file=' /home/jake/.config/ia-alternate.ini' ) 


IA-S3 Configuration 

Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at 
https://archive.org/account/s3.php. 

They can be specified in your config file like so: 

[s3] 

access = mYaccEsSkEY 
secret = itiYs3cREtKEy 

Or, using the ArchiveSession object: 

»> from internetarchive import get_session 

>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy' }} 

»> s = get_session (conf ig=c) 

»> s.access_key 
'mYaccEsSkEY' 
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Cookie Configuration 

Your archive.org logged-in cookies are required for downloading access-restricted files that you have permissions to 
and retrieving information about archive.org catalog tasks. 

Your cookies can be specified like so: 

[cookies] 

logged-in-user = user%40example.com 
logged-in-sig = <redacted> 

Or, using the ArchiveSession object: 

»> from internetarchive import get_session 

»> c = {'cookies': {'logged-in-user': 'useri40example.com', 'logged-in-sig': ' foo' }} 

»> s . cookies [' logged-in-user' ] 

'user%40example.com' 


Logging Configuration 

You can specify logging levels and the location of your log file like so: 

[logging] 

level = INFO 

file = /tmp/ia.log 

Or, using the ArchiveSession object: 

»> from internetarchive import get_session 

»> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}} 

»> s = get_session (conf ig=c) 

By default logging is turned off. 

Other Configuration 

By default all requests are HTTPS in Python versions 2.7.10 or newer. You can change this setting in your config file 
in the general section: 

[general] 
secure = False 

Or, using the ArchiveSession object: 

»> from internetarchive import get_session 

»> s = get_session(config={ 'general' : {'secure': False}}) 

In the example above, all requests will be made via HTTP. 


ArchiveSession Objects 

The ArchiveSession object is subclassed from requests . Session. It collects together your credentials and config. 
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get_session ( config=None , configJile=None , debug=None, http_adapter_kwargs=None) 

Return a new ArchiveSession object. The ArchiveSession object is the main interface to the 
internetarchive lib. It allows you to persist certain parameters across tasks. 

Parameters 

• config (diet) - (optional) A dictionary used to configure your session. 

• config_file (str) - (optional) A path to a config file used to configure your session. 

• http_adapter_kwargs (diet) - (optional) Keyword arguments that 

requests . adapters . HTTPAdapter takes. 

Returns ArchiveSession object. 

Usage: 

»> from internetarchive import get_session 
»> config = diet (s3=dict(access=' foo' , secret=' bar' )) 

»> s = get_session (config) 

»> s. access_key 
' too' 

From the session object, you can access all of the functionality of the internetarchive lib: 

»> item = s . get_item ( ' nasa' ) 

»> item.download() 
nasa: ddddddd - success 

»> s.get_tasks(task_ids=31643513)[0].server 
'ia311234' 


Item Objects 

Item objects represent Internet Archive items. From the Item object you can create new items, upload files to 
existing items, read and write metadata, and download or delete files. 

get_item ( identifier, config=None, config Jile=None, archive_session=None, debug=None, 

http_adapter_kwargs=None, request_kwargs=None) 

Get an Item object. 

Parameters 

• identifier (str) - The globally unique Archive.org item identifier. 

• config (diet) - (optional) A dictionary used to configure your session. 

• config_file (str) - (optional) A path to a config file used to configure your session. 

• archive_session (ArchiveSession) - (optional) An ArchiveSession object can be 
provided via the archive_session parameter. 

• http_adapter_kwargs (diet) - (optional) Keyword arguments that 

requests . adapters . HTTPAdapter takes. 

• request_kwargs (diet) - (optional) Keyword arguments that requests . Request takes. 


Usage: 

»> from internetarchive import get_item 
»> item = get_item('a,ftsa' ) 

»> item.item_size 
121084 
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Uploading 

Uploading to an item can be done using Item. upload (): 

>» item = get_item('my_item' ) 

»> r = item, upload ( ' /hofiie/user/foo .txt' ) 

Or internetarchive.upload() : 

»> from internetarchive import upload 

»> r = upload (' my_item' , ' /home/user/foo . txt' ) 

The item will automatically be created if it does not exist. 

Refer to archive.org Identifiers for more information on creating valid archive.org identifiers. 


Setting Remote Filenames 

Remote filenames can be defined using a dictionary: 

»> from io import Bytes 10 
>» fh = BytesIOO 
>» fh. write (b' foo bar') 

»> item.upload({' my-remote-filename . txt ' : fh}) 

upload ( identifier, files, metadata=None, headers=None, access_key=None, secret_key=None, 
queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, 
retries_sleep=None, debug=None, request_kwargs=None, **get_item_kwargs) 

Upload files to an item. The item will be created if it does not exist. 

Parameters 

• identifier (str) - The globally unique Archive.org identifier for a given item. 

• files - The filepaths or file-like objects to upload. This value can be an iterable or a single 
file-like object or string. 

• metadata (diet) - (optional) Metadata used to create a new item. If the item already exists, 
the metadata will not be updated - use modify_metadata. 

• headers (diet) - (optional) Add additional HTTP headers to the request. 

• access_key (str) - (optional) IA-S3 access_key to use when making the given request. 

• secret_key (str) - (optional) IA-S3 secret_key to use when making the given request. 

• queue_derive (bool) - (optional) Set to False to prevent an item from being derived after 
upload. 

• verbose (boot) - (optional) Display upload progress. 

• verify (boot) - (optional) Verify local MD5 checksum matches the MD5 checksum of the 
file received by IAS3. 

• checksum (boot) - (optional) Skip uploading files based on checksum. 

• delete ( boot) - (optional) Delete local file after the upload has been successfully verified. 

• retries (int) - (optional) Number of times to retry the given request if S3 returns a 503 
SlowDown error. 

• retries_sleep (inf) - (optional) Amount of time to sleep between retries. 
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• debug ( bool) - (optional) Set to True to print headers to stdout, and exit without sending the 
upload request. 

• **kwargs - Optional arguments that get_item takes. 

Returns A list of requests . Response objects. 


Metadata 

modify_metadata ( identifier , metadata , target=None, append=None, append_list=None, prior- 
ity=None, access_key=None, secret_key=None, debug=None, request_kwargs=None, 

* *get_item_kwargs ) 

Modify the metadata of an existing item on Archive.org. 

Parameters 

• identifier (str) - The globally unique Archive.org identifier for a given item. 

• metadata (diet) - Metadata used to update the item. 

• target (str) - (optional) The metadata target to update. Defaults to metadata. 

• append (boot) - (optional) set to True to append metadata values to current values rather 
than replacing. Defaults to False. 

• append_list (bool) - (optional) Append values to an existing multi-value metadata field. No 
duplicate values will be added. 

• priority (int) - (optional) Set task priority. 

• access_key (str) - (optional) IA-S3 access_key to use when making the given request. 

• secret_key (str) - (optional) IA-S3 secret_key to use when making the given request. 

• debug (bool) - (optional) set to True to return a requests .Request object instead of 
sending request. Defaults to False. 

• **get_item_kwargs - (optional) Arguments that get_item takes. 

Returns requests . Response object or requests .Request object if debug is True. 

The default target to write to is metadata. If you would like to write to another target, such as files, you can 
specify so using the target parameter. For example, if we had an item whose identifier was my_identif ier and 
you wanted to add a metadata field to a file within the item called foo.txt: 

»> r = modify_metadata( 'my_identifier' , metadata=dict(title=' My File'), target = 'files/foo.txt' ) 

»> from internetarchive import get_files 

>» f = list (get_files ( ' iacli-test-it@»5t3T' , *fSo.txt' )) [0] 

»> f.title 
'My File' 

You can also create new targets if they don’t exist: 

»> r = modify_metadata( 'my_identifier' , metadata=dict(foo='bar ' ), target=' extra_metadat:a' ) 

»> from internetarchive import get_item 
»> item = get_item('my_identifier') 

>» item.item_metadata[ 'extra_metadata' ] 

{' too ': 'bar'} 
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Downloading 

download (identifier, files=None, formats=None, glob_pattern=None, dry_run=None, verbose=None, 
silent=None, ignore_existing=None, checksum=None, destdir=None, no_directory=None, re- 
tries=None, item_index=None, ignore_errors=None, on_the_fly=None, retum_responses=None, 
**get_item_kwargs) 

Download files from an item. 

Parameters 

• identifier (str) - The globally unique Archive.org identifier for a given item. 

• files - (optional) Only return files matching the given file names. 

• formats - (optional) Only return files matching the given formats. 

• glob_pattem (sir) - (optional) Only return files matching the given glob pattern. 

• dry_run (bool) - (optional) Print URLs to files to stdout rather than downloading them. 

• verbose ( bool) - (optional) Turn on verbose output. 

• silent ( bool) - (optional) Suppress all output. 

• ignore_existing (bool) - (optional) Skip files that already exist locally. 

• checksum ( bool) - (optional) Skip downloading file based on checksum. 

• destdir (str) - (optional) The directory to download files to. 

• no_directory (bool) - (optional) Download files to current working directory rather than 
creating an item directory. 

• retries (int) - (optional) The number of times to retry on failed requests. 

• item_index (int) - (optional) The index of the item for displaying progress in bulk down¬ 
loads. 

• ignore_errors (bool) - (optional) Don’t fail if a single file fails to download, continue to 
download other files. 

• on_the_fly (bool) - (optional) Download on-the-fly files (i.e. derivative EPUB, MOBI, 
DAISY files). 

• return_responses (bool) - (optional) Rather than downloading files to disk, return a list of 
response objects. 

• **kwargs - Optional arguments that get_item takes. 

Return type bool 

Returns True if all files were downloaded successfully. 


Deleting 

delete (identifier, files=None, formats=None, glob_pattem=None, cascade_delete=None, ac- 

cess_key=None, secret_key=None, verbose=None, debug=None, **kwargs) 

Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted. 

Parameters 

• identifier (str) - The globally unique Archive.org identifier for a given item. 

• files - (optional) Only return files matching the given filenames. 

• formats - (optional) Only return files matching the given formats. 
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• glob_pattem (str) - (optional) Only return files matching the given glob pattern. 

• cascade_delete (bool) - (optional) Also deletes files derived from the file, and files the 
filewas derived from. 

• access_key (str) - (optional) IA-S3 access_key to use when making the given request. 

• secret_key (str) - (optional) IA-S3 secret_key to use when making the given request. 

• verbose ( bool) - Print actions to stdout. 

• debug (bool) - (optional) Set to True to print headers to stdout and exit exit without sending 
the delete request. 


File Objects 

get_files ( identifier, files=None, formats=None, glob_pattern=None, onjthe_fly=None, 

**get_item_kwargs) 

Get File objects from an item. 

Parameters 

• identifier (str) - The globally unique Archive.org identifier for a given item. 

• files - iterable 

• files - (optional) Only return files matching the given filenames. 

• formats - iterable 

• formats - (optional) Only return files matching the given formats. 

• glob_pattem (sir) - (optional) Only return files matching the given glob pattern. 

• on_the_fly (bool) - (optional) Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY 
files). 

• **get_item_kwargs - (optional) Arguments that get_item () takes. 


Usage: 

»> from internetarchive import get_files 

»> fnames = [f.name for f in get_files( 'nasa' , glob_j?attern=' *xml' )] 
»> print (fnames) 

[ ’ nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml'] 


Searching Items 

search_iterns (query, fields=None, sorts=None, params=None, archive_session=None, config=None, con- 
figJile=None, http_adapter_kwargs=None, request_kwargs=None, max_retries=None) 
Search for items on Archive.org. 

Parameters 

• query (str) - The Archive.org search query to yield results for. Refer to 
https://archive.Org/advancedsearch.php#raw for help formatting your query. 

• fields (list) - (optional) The metadata fields to return in the search results. 

• params (diet) - (optional) The URL parameters to send with each request sent to the 
Archive.org Advancedsearch Api. 

• secure - (optional) Configuration options for session. 
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• config_file (str) - (optional) A path to a config file used to configure your session. 

• http_adapter_kwargs (diet) - (optional) Keyword arguments that 

requests . adapters . HTTP Adapter takes. 

• request_kwargs (diet) - (optional) Keyword arguments that requests . Request takes. 

• max_retries (int, object) - The number of times to retry a failed request. This can also be 
an urllib3.Retry object. If you need more control (e.g. status Jbrcelist), use a ArchiveSes- 
sion object, and mount your own adapter after the session object has been initialized. For 
example: 

»> s = get_session () 

»> s . mount_http_adapter () 

»> search_results = s . search^items( 'nasa' ) 

See ArchiveSession .mount_http_adapter () for more details. 

Returns A Search object, yielding search results. 


Internet Archive Tasks 

get_tasks ( identifier=None , task_ids=None, taskJype=None, params=None, config=None, con- 
fig_file=None, verbose=None, archive _session=None, http_adapterJew args=None, re¬ 
quest Jewargs=None ) 

Get tasks from the Archive.org catalog, internetarchive must be configured with your logged-in-* cook¬ 
ies to use this function. If no arguments are provided, all queued tasks for the user will be returned. 

Parameters 

• identifier (str) - (optional) The Archive.org identifier for which to retrieve tasks for. 

• task_ids (int or str) - (optional) The task_ids to retrieve from the Archive.org catalog. 

• task_type (str) - (optional) The type of tasks to retrieve from the Archive.org catalog. The 
types can be either “red” for failed tasks, “blue” for running tasks, “green” for pending tasks, 
“brown” for paused tasks, or “purple” for completed tasks. 

• params (diet) - (optional) The URL parameters to send with each request sent to the 
Archive.org catalog API. 

• secure - (optional) Configuration options for session. 

• verbose (bool) - (optional) Set to True to retrieve verbose information for each catalog 
task returned, verbose is set to True by default. 

Returns A set of CatalogTask objects. 


Updates 

Release History 

1 . 8.0 ( 2018 - 06 - 28 ) 

Feautres and Improvements 

• Only use backports.csv for python2 in support of FreeBDS port. 
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• Added a nicer error message to ia search for authentication errors. 

• Added support for using netrc files in i a configure. 

• Added — remove option to ia metadata for removing values from single or mutli-field metadata elements. 

• Added support for appending a metadata value to an existing metadata element (as a new entry, not simply 
appending to a string). 

• Added — no-change-timestamp flag to ia download. Download files retain the timestamp of “now”, 
not of the source material when this option is used. 

Bugfixes 

• Fixed bug in upload where StringlO objects were not uploadable. 

• Fixed encoding issues that were causing some ia tasks commands to fail. 

• Fixed bug where keep-old-version wasn’t working in ia move. 

• Fixed bug in internetarchive. api .modify_metadata where debug and other args were not hon¬ 
oured. 

1.7.7 (2018-03-05) 

Feautres and Improvements 

• Added support for downloading on-the-fly archive_marc.xml files. 

Bugfixes 

• Improved syntax checking in ia move and ia copy. 

• Added Connection:close header to all requests to force close connections after each request. This is a 
workaround for dealing with a bug on archive.org servers where the server hangs up before sending the complete 
response. 

1.7.6 (2018-01-05) 

Feautres and Improvements 

• Added ability to set the remote-name for a directory in ia upload (previously you could only do this for 
single files). 

Bugfixes 

• Fixed bug in ia delete where all requests were failing due to a typo in a function arg. 


1.7.5 (2017-12-07) 

Feautres and Improvements 

• Turned on x-archive-keep-old-version S3 header by default for all ia upload, ia delete, ia 
copy, and ia move commands. This means that any ia command that clobbers or deletes a command, will 
save a version of the file in <identifier>/history/files/$key. ~N~. This is only on by default in 
the CLI, and not in the Python lib. It can be turne off by adding -H x-archive-keep-old-version: 0 
to any ia upload, ia delete, ia copy, or ia move command. 
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1.7.4 (2017-11-06) 

Feautres and Improvements 

• Increased timeout in search from 12 seconds to 24. 

• Added ability to set the max_retries in internetarchive. sea.tch_items () . 

• Made internetarchive . ArchiveSes s ion .mount_http_adapter () a public method for support¬ 
ing complex custom retry logic. 

• Added — timeout option to ia search for setting a custom timeout. 

• Loosened requirements for schema library to schema>=0.4.0. 

Bugfixes 

• The scraping API has reverted to using items key rather than docs key. vl.7.3 will still work, but this change 
keeps ia consistent with the API. 

1.7.3 (2017-09-20) 

Bugfixes 

• Fixed bug in search where search requests were failing with KeyError : ' items ' . 

1.7.2 (2017-09-11) 

Feautres and Improvements 

• Added support for adding custom headers to ia search. 

Bugfixes 

• internetarchive . utils . get_s3_xml_text () is used to parse errors returned by S3 in XML. Some¬ 
times there is no XML in the response. Most of the time this is due to 5xx errors. Either way, we want to always 
return the HTTPError, even if the XML parsing fails. 

• Fixed a regression where : was being stripped from filenames in upload. 

• Do not create a directory in download ( ) when return_responses is True. 

• Fixed bug in upload where file-like objects were failing with a TypeError exception. 

1.7.1 (2017-07-25) 

Bugfixes 

• Fixed bug in Item.upload_f ile () where checksum was being set to True if it was set to None. 

1.7.1 (2017-07-25) 

Bugfixes 

• Fixed bug in ia upload where all commands would fail if multiple collections were specified (e.g. -m 
collection:foo -m collection:bar). 


1.7. Updates 
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1.7.0 (2017-07-25) 

Feautres and Improvements 

• Loosened up jsonpatch requirements, as the metadata API now supports more recent versions of the JSON 
Patch standard. 

• Added support for building “snap” packages (https://snapcraft.io/). 

Bugfixes 

• Fixed bug in upload where users were unable to add their own timeout via request_kwargs. 

• Fixed bug where files with non-ascii filenames failed to upload on some platforms. 

• Fixed bug in upload where metadata keys with an index (e.g. sub j ect [ 0 ]) would make the request fail if the 
key was the only indexed key provided. 

• Added a default timeout to ArchiveSession. s3_is_overloaded () . If it times out now, it returns 
True (as in, yes, S3 is overloaded). 


1.6.0 (2017-06-27) 

Features and Improvements 

• Added 60 second timeout to all upload requests. 

• Added support for uploading empty files. 

• Refactored Item. get_f iles () to be faster, especially for items with many files. 

• Updated search to use IA-S3 keys for auth instead of cookies. 

Bugfixes 

• Fixed bug in upload where derives weren’t being queued in some cases where checksum=True was set. 

• Fixed bug where ia tasks and other Catalog functions were always using HTTP even when it should have 
been HTTPS. 

• ia metadata was exiting with a non-zero status for “no changes to xml” errors. This now exits with 0, as 
nearly every time this happens it should not be considered an “error”. 

• Added Unicode support to ia upload — spreadsheet and ia metadata — spreadsheet using 
the backports . csv module. 

• Fixed bug in ia upload —spreadsheet where some metadata was accidentally being copied from pre¬ 
vious rows (e.g. when multiple subjects were used). 

• Submitter wasn’t being added to i a tasks —json ouptut, itnowis. 

• row_type in ia tasks — j son was returning integer for row-type rather than name (e.g. ‘red’). 


1.5.0 (2017-02-17) 

Features and Improvements 

• Added option to download() for returning a list of response objects rather than writing files to disk. 
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1.4.0 (2017-01-26) 

Bugfixes 

• Another bugfix for setting mtime correctly after f ileob j functionality was added to ia download. 

1.3.0 (2017-01-26) 

Bugfixes 

• Fixed bug where download was trying to set mtime, even when f ileobj was set to True (e.g. ia 
download <id> <file> —stdout). 

1.2.0 (2017-01-26) 

Features and Improvements 

• Added ia copy and ia move for copying and moving files in archive.org items. 

• Added support for outputing JSON in i a tasks. 

• Added support to ia download to write to stdout instead of file. 

Bugfixes 

• Fixed bug in upload where AttributeError was rasied when trying to upload file-like objects without a name 
attribute. 

• Removed identifier validation from ia delete. If an identifier already exists, we don’t need to validate it. 
This only makes things annoying if an identifier exists but fails internetarchive id validation. 

• Fixed bug where error message isn’t returned in ia upload if the response body is not XML. Ideally IA-S3 
would always return XML, but that’s not the case as of now. Try to dump the HTML in the S3 response if unable 
to parse XML. 

• Fixed bug where ArchiveSession headers weren’t being sent in prepared requests. 

• Fixed bug in ia upload — size-hint where value was an integer, but requests requries it to be a string. 

• Added support for downloading files to stdout in ia download and File . download. 

1.1.0 (2016-11-18) 

Features and Improvements 

• Make sure collection exists when creating new item via ia upload. If it doesn’t, upload will fail. 

• Refactored tests. 

Bugfixes 

• Fixed bug where the full filepath was being set as the remote filename in Windows. 

• Convert all metadata header values to strings for compatability with requests>=2 .11.0. 

1.0.10(2016-09-20) 

Bugfixes 

• Convert x-archive-cascade-delete headers to strings for compatability with requests>=2.11.0. 


1.7. Updates 
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1.0.9 (2016-08-16) 

Features and Improvements 

• Added support to the CLI for providing username and password as options on the command-line. 


1.0.8 (2016-08-10) 

Features and Improvements 

• Increased maximum identifier length from 80 to 100 characters in ia upload. 

Bugfixes 

• As of version 2.11.0 of the requests library, all header values must be strings (i.e. not integers), 
internetarchive now converts all header values to strings. 

1.0.7 (2016-08-02) 

Features and Improvements 

• Added internetarchive.api.get_user_info(). 

1.0.6 (2016-07-14) 

Bugfixes 

• Fixed bug where upload was failing on file-like objects (e.g. StringlO objects). 


1.0.5 (2016-07-07) 

Features and Improvements 

• All metadata writes are now submitted at -5 priority by default. This is friendlier to the archive.org catalog, and 
should only be changed for one-off metadata writes. 

• Expanded scope of valid identifiers in utils . validate_ia_identifier (i.e. ia upload). Periods 
are now allowed. Periods, underscores, and dashes are not allowed as the first character. 


1.0.4 (2016-06-28) 

Features and Improvements 

• Search now uses the vl scraping API endpoint. 

•Moved internetarchive.item.Item.upload.iter_directory () to 

internetarchive.utils. 

• Added support for downloading “on-the-fly” files (e.g. EPUB, MOBI, and DAISY) via ia download <id> 
—on-the-fly or item.download(on_the_fly=True). 

Bugfixes 

• s3_is_overloaded () now returns True if the call is unsuccessful. 

• Fixed bug in upload where a derive task wasn’t being queued when a directory is uploaded. 
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1.0.3 (2016-05-16) 

Features and Improvements 

• Use scrape API for getting total number of results rather than the advanced search API. 

• Improved error messages for IA-S3 (upload) related errors. 

• Added retry suport to delete. 

• ia delete no longer exits if a single request fails when deleting multiple files, but continues onto the next 
file. If any file fails, the command will exit with a non-zero status code. 

• All search requests now require authentication via IA-S3 keys. You can run ia configure to generate a 
config file that will be used to authenticate all search requests automatically. For more details refer to the 
following links: 

http://internetarchive.readthedocs.io/en/latest/quickstart.html?highlight=conhgure#configuring 

http://internetarchive.readthedocs.io/en/latest/api.html#configuration 

• Added ability to specify your own filepath in ia configure and internetarchive. configure (). 

Bugfixes 

• Updated requests lib version requirements. This resolves issues with sending binary strings as bodies in 
Python 3. 

• Improved support for Windows, see https://github.com/jjjake/internetarchive/issues/126 for more details. 

• Previously all requests were made in HTTP for Python versions < 2.7.9 due to the issues 
described at https://urllib3.readthedocs.org/en/latest/security.html. In favor of security over con¬ 
venience, all requests are now made via HTTPS regardless of Python version. Refer to 
http://internetarchive.readthedocs.0rg/en/latest/troubleshooting.html#https-issues if you are experiencing issues. 

• Fixed bug in ia CLI where — insecure was still making HTTPS requests when it should have been making 
HTTP requests. 

• Fixed bug in ia delete where — all option wasn’t working because it was using item. iter_f iles 
instead of item. get_f iles. 

• Fixed bug in ia upload where uploading files with Unicode file names were failing. 

• Fixed bug in upload where filenames with ; characters were being truncated. 

• Fixed bug in internetarchive . catalog where TypeError was being raised in Python 3 due to mixing 
bytes with strings. 


1.0.2 (2016-03-07) 

Bugfixes 

• Fixed OverflowError bug in uploads on 32-bit systems when uploading files larger than ~2GB. 

• Fixed Unicode bug in upload where urllib. parse . quote is unable to parse non-encoded strings. 

Features and Improvements 

• Only generate MD5s in upload if they are used (i.e. verify, delete, or checksum is True). 

• verify is off by default in i a upload, it can be turned on with i a upload — verify. 


1.7. Updates 


33 




internetarchive Documentation, Release 1.8.0 


1.0.1 (2016-03-04) 

Bugfixes 

• Fixed memory leak in ia upload -spreadsheet=metadata.csv. 

• Fixed arg parsing bug in ia CLI. 

1.0.0 (2016-03-01) 

Features and Improvements 

• Renamed internetarchive.iacli to internetarchive.cli. 

• Moved File object to internetarchive . files. 

• Converted config fromat from YAML to INI to avoid PyYAML requirement. 

• Use HTTPS by default for Python versions > 2.7.9. 

• Added get_username function to API. 

• Improved Python 3 support, internetarchive is now being tested against Python versions 2.6, 2.7, 3.4, 
and 3.5. 

• Improved plugin support. 

• Added retry support to download and metadata retrieval. 

• Added Collection object. 

• Made Item objects hashable and orderable. 

Bugfixes 

• IA’s Advanced Search API no longer supports deep-paging of large result sets. All search functions have been 
refactored to use the new Scrape API (http://archive.org/help/aboutsearch.htm). Search functions in previous 
versions are effictively broken, upgrade to >=1.0.0. 

0.9.8 (2015-11-09) 

Bugfixes 

• Fixed ia help bug. 

• Fixed bug in File.downloadQ where connection errors weren’t being caught/retried correctly. 

0.9.7 (2015-11-05) 

Bugfixes 

• Cleanup partially downloaded files when download() fails. 

Features and Improvements 

• Added -format option to ia delete. 

• Refactored download() and ia download to behave more like rsync. Files are now clobbered by default, ig- 
nore_existing and -ignore-existing now skip over files already downloaded without making a request. 

• Added retry support to download() and ia download. 

• Added files kwarg to Item.download() for downloading specific files. 
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• Added ignore_errors option to File.download() for ignoring (but logging) exceptions. 

• Added default timeouts to metadata and download requests. 

• Less verbose output in ia download by default, use ia download -verbose for old style output. 

0.9.6 (2015-10-12) 

Bugfixes 

• Removed sync-db features for now, as lazytaable is not playing nicely with setup.py right now. 

0.9.5 (2015-10-12) 

Features and Improvements 

• Added skip based on mtime and length if no other clobber/skip options specified in download() and ia download. 

0.9.4 (2015-10-01) 

Features and Improvements 

• Added internetarchive.api.get_username() for retrieving a username with an S3 key-pair. 

• Added ability to sync downloads via an sqlite database. 

0.9.3 (2015-09-28) 

Features and Improvements 

• Added ability to download items from an itemlist or search query in ia download. 

• Made ia configure Python 3 compatabile. 

Bugfixes 

• Fixed bug in ia upload where uploading an item with more than one collection specified caused the collection 
check to fail. 

0.9.2 (2015-08-17) 

Bugfixes 

• Added error message for failed ia configure calls due to invalid creds. 

0.9.1 (2015-08-13) 

Bugfixes 

• Updated docopt to vO.6.2 and Py YAML to v3.11. 

• Updated setup.py to automatically pull version from_ init _. 
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0.8.5 (2015-07-13) 

Bugfixes 

• Fixed UnicodeEncodeError in ia metadata -append. 

Features and Improvements 

• Added configuration documentation to readme. 

• Updated requests to v2.7.0 

0.8.4 (2015-06-18) 

Features and Improvements 

• Added check to ia upload to see if the collection being uploaded to exists. Also added an option to override this 
check. 

0.8.3 (2015-05-18) 

Features and Improvements 

• Fixed append to work like a standard metadata update if the metadata field does not yet exist for the given item. 

0.8.0 2015-03-09 
Bugfixes 

• Encode filenames in upload URLs. 

0.7.9 (2015-01-26) 

Bugfixes 

• Fixed bug in internetarchive. config.get_auth_config (i.e. ia configure ) where logged-in cookies returned expired 
within hours. Cookies should now be valid for about one year. 

0.7.8 (2014-12-23) 

• Output error message when downloading non-existing files in ia download rather than raising Python exception. 

• Fixed IOError in ia search when using head, tail, etc.. 

• Simplified ia search to output only JSON, rather than doing any special formatting. 

• Added experimental support for creating pex binaries of ia in Makefile. 

0.7.7 (2014-12-17) 

• Simplified ia configure. It now only asks for Archive.org email/password and automatically adds S3 keys and 
Archive.org cookies to config. See intemetarchive.config.get_auth_config(). 
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0.7.6 (2014-12-17) 

• Write metadata to stdout rather than stderr in ia mine. 

• Added options to search archive.org/v2. 

• Added destdir option to download files/itemdirs to a given destination dir. 

0.7.5 (2014-10-08) 

• Fixed typo. 

0.7.4 (2014-10-08) 

• Fixed missing “import” typo in internetarchive.iacli.ia_upload. 

0.7.3 (2014-10-08) 

• Added progress bar to ia mine. 

• Fixed Unicode metadata support for upload(). 

0.7.2 (2014-09-16) 

• Suppress Keyboardlnterrupt exceptions and exit with status code 130. 

• Added ability to skip downloading files based on checksum in ia download, Item.download(), and 
File.download(). 

• ia download is now verbose by default. Output can be suppressed with the -quiet flag. 

• Added an option to not download into item directories, but rather the current working directory (i.e. ia download 
-no-directories <id >). 

• Added/fixed support for modifying different metadata targets (i.e. files/logo.jpg). 

0.7.1 (2014-08-25) 

• Added Item.s3_is_overloaded() method for S3 status check. This method is now used on retries in the upload 
method now as well. This will avoid uploading any data if a 503 is expected. If a 503 is still returned, retries are 
attempted. 

• Added -status-check option to ia upload for S3 status check. 

• Added -source parameter to ia list for returning files matching IA source (i.e. original, derivative, metadata, 
etc.). 

• Added support to ia upload for setting remote-name if only a single file is being uploaded. 

• Derive tasks are now only queued after the last file has been uploaded. 

• File URLs are now quoted in File objects, for downloading files with specail characters in their filenames 

0.7.0 (2014-07-23) 

• Added support for retry on S3 503 SlowDown errors. 


1.7. Updates 
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0.6.9 (2014-07-15) 

• Added support for n and r characters in upload headers. 

• Added support for reading filenames from stdin when using the ia delete command. 

0.6.8 (2014-07-11) 

• The delete ia subcommand is now verbose by default. 

• Added glob support to the delete ia subcommand (i.e. ia delete -glob=’*jpg’). 

• Changed indexed metadata elements to clobber values instead of insert. 

• AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are now deprecated. IAS3_ACCESS_KEY and 
IAS3_SECRET_KEY must be used if setting IAS3 keys via environment variables. 


Troubleshooting 

HTTPS Issues 

The internetarchive library uses the HTTPS protocol for making secure requests by default. This can cause 
issues when using versions of Python earlier than 2.7.9: 

Certain Python platforms (specifically, versions of Python earlier than 2.7.9) have restrictions in their ssl 
module that limit the configuration that urllib3 can apply. In particular, this can cause HTTPS requests 
that would succeed on more featureful platforms to fail, and can cause certain security features to be 
unavailable. 

See https://urllib3.readthedocs.org/en/latest/security.html for more details. 

If you are using a Python version earlier than 2.7.9, you might see InsecurePlatformWarning and 
SNIMissingWarning warnings and your requests might fail. There are a few options to address this issue: 

1. Upgrade your Python to version 2.7.9 or more recent. 

2. Install or upgrade the following Python modules as documented here: PyOpenSSL, ndg-httpsclient, 
and pyasni* 

3. Use HTTP to make insecure requests in one of the following ways: 

• Adding the following lines to your ia. ini config file (usually located at ~/. conf ig/ia. ini or ~/ . ia. 


[general,] 
secure = false 

• In the Python interface, using a config diet: 

»> from internetarchive import get_item 
»> config = diet (general=dict(secure=False)) 

»> item = get_item( '<identifier>' , config=config) 

• In the command-line interface, use the —insecure option: 

$ ia —insecure download <identifier> 
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OverflowError 

On some 32-bit systems you may run into issues uploading files larger than 2 GB. You may see an error that looks 
something like OverflowError: long int too large to convert to int. You can get around 
this by upgrading requests: 

pip install —upgrade requests 

You can find more details about this issue at the following links: 

https://github.com/sigmavirus24/requests-toolbelt/issues/80 https://github.com/kennethreitz/requests/issues/2691 


How to Contribute 


Thank you for considering contributing. All contributions are welcome and appreciated! 


Support Questions 

Please don’t use the Github issue tracker for asking support questions. All support questions should be emailed to 
info @ archi ve. org . 


Bug Reports 

Github issues is used for tracking bugs. Please consider the following when opening an issue: 

• Avoid opening duplicate issues by taking a look at the current open issues. 

• Provide details on the version, operating system and Python version you are running. 

• Include complete tracebacks and error messages. 


Pull Requests 

All pull requests and patches are welcome, but please consider the following: 

• Include tests. 

• Include documentation for new features. 

• If your patch is supposed to fix a bug, please describe in as much detail as possible the circumstances in which 
the bug happens. 

• Please follow PEP8, with the exception of what is ignored in setup.cfg. PEP8 compliancy is checked when tests 
run. Tests will fail if your patch is not PEP8 compliant. 

• Add yourself to AUTHORS .rst. 

• Avoid introducing new dependencies. 

• Open an issue if a relevant one is not already open, so others have visibility into what you’re working on and 
efforts aren’t duplicated. 

• Clarity is preferred over brevity. 
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Running Tests 

The minimal requirements for running tests are pytest, pytest-pep8 and responses: 

$ pip install pytest pytest-pep8 responses 
Clone the internetarchive lib: 

$ git clone https://githQb.cora/jjlake/internetarchive 

Install the internetarchive lib as an editable package: 

$ od. internetarchive 
$ pip install -e , 

Run the tests: 

$ py.test — pep8 

Note that this will only test against the Python version you are currently using, however internetarchive tests 
against multiple Python versions defined in tox.ini. Tests must pass on all versions defined in tox. ini for all pull 
requests. 

To test against all supported Python versions, first make sure you have all of the required versions of Python installed. 
Then simply install execute tox from the root directory of the repo: 

$ pip install tox 
$ tox 

Even easier is simply creating a pull request. Travis is used for continuous integration, and is set up to run the full 
testsuite whenever a pull request is submitted or updated. 


Authors 


The Internet Archive Python library and command-line tool is written and maintained by Jake Johnson and various 
contributors: 


Development Lead 

• Jake Johnson <jake@archive.org> 


Contributors 

• Bryce Drennan <internetarchive@brycedrennan.com> 


Patches and Suggestions 

• VM Brasseur 


40 


Chapter 1. User’s Guide 




Index 


D 

delete() (in module internetarchive), 25 
download() (in module internetarchive), 25 

G 

get_files() (in module internetarchive), 26 
get_item() (in module internetarchive), 22 
get_session() (in module internetarchive), 21 
get_tasks() (in module internetarchive), 27 

I 

internetarchive (module), 20 

M 

modify_metadata() (in module internetarchive), 24 

s 

search_items() (in module internetarchive), 26 

u 

upload() (in module internetarchive), 23 


41 





